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Abstract 

This paper describes a new algorithm - P&A algorithm - utilized 
in identifying overlapping communities in non oriented valued graph 
regardless of their number or their size. The complexity of this algo- 
rithm is minimal in the matter that the number of operations grows 
linearly with the number of vertices. 

1 Introduction 

Several algorithms have already been proposed in order to calculate 
graph partitions. These algorithms can be classified in two categories: 
the agglomerative methods in which we calculate outdistances between 
vertices and then incorporates the nearest points; and the divisive 
methods in which we take the whole graph and delete repeatedly arcs, 
dividing then the graph in new connex component. 

Nevertheless, in some contexts of application of these methods of par- 
tition of graphs, the fact of proposing strict partitions, where a node of 
the graph is associated to a single cluster, might appear unsuited. Our 
research topics on the partition of graphs are applied in the field of the 
management, where the final objective consists in - roughly speaking 

- discovering "professional communities" starting from the analysis of 
graphs created from the logs of electronic exchanges between the em- 
ployees. In this type of application, a desired property is to obtain at 
the end "overlapping partitions" from the graphs, i.e. partitions where 
a node - representing in our application an employee of the company 

- can be possibly associated to several communities. For instance, it 
is common to notice, in companies implementing matrix organisation 
structure, employees who take part in several projects and so, whom 
belong to several professional communities simultaneously. 

There are only a few algorithms that basically have capacity to find 
out overlapping communities [09] [04]. Nevertheless, we have to notice 
that such kind of results can also be obtained thanks to the aggregation 
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of partitions calculated with data arbitrary modified [02] or calculated 
by non-determinist algorithms. 

2 Our algorithm 

Proposed algorithm is neither divisive nor agglomerative. It belongs 
to a new kind of algorithms in two phases consisting in first extracting 
subsets of linked points from the graph and then incorporating those 
subsets in stable communities. We call it Pull & Aggregate algorithm 
(P&A algorithm). 

2.1 Extracting subsets of points... 

This first phase lists subsets of vertices broadly linked one to the others, 
subsets that will be used in the second phase to create communities. 
Different methods are possible to find those subsets of points out, for 
instance, methods through simulation of random walks. But another 
method was chosen, determinist and rather inexpensive in calculations: 
barycentric method. 

In order to physically represent oneself the system, we have to imagine 
a set of pivots representing vertices linked to another by springs of null 
initial length and of stiffness defined by the value of the arc. The prin- 
ciple is to apply to this physical system a number of specific or uniform 
external forces. To calculate the positions of equilibrium, we use work 
of Huberman on the electric potentials [06] . It shows indeed that an it- 
erative algorithm of successive calculation of weighted averages allows 
the resolution of the equations of Kirchhoff in a number of operations 
regardless of the number of vertices. We will use this type of algorithm 
for the calculation of the positions of the pivots in dimensions in which 
only specific forces are exerted, and a slightly modified algorithm of 
comparable nature when it is a question of also taking into account 
uniform forces. 

Within large graphs, we work in two dimensions: a dimension ac- 
cording to X (specific forces), and a dimension according to Z (specific 
forces and uniform forces). For the smaller graphs, some adaptations 
are necessary - addition of a dimension Y, choice of several poles - but 
this is not the point of this present article. 

2.1.1 Equilibrium position according to X 

For dimension X, we successively consider each vertex of the graph. 
With a step i, we note this point Ai and fix its value to 1. We also fix 
at a value all the vertices distant of more than n steps or diluted more 
than a x breakpoint in the case of graphs valued. To find these points 
quickly, we use BFS algorithm for not valued arcs and the algorithm 
of Dijkstra for valued arcs. All the other points are mobile. We then 
repeatedly calculate by weighted averages the equilibrium position of 
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all the mobile points until reaching a steady equilibrium. Huberman 
shows that this phase of the algorithm is carried out in a time which 
does not depend on n. We can indeed make the analogy for calculations 
between the laws defining the value of electrical current in each node of 
an electronic circuit and the principle of the equilibrium of the forces: 
sum of forces is null and sum of intensities is null in each node; I = 
1 /R*U and the strength of a spring F = k*L; the concept of the voltage 
(difference of potential) is homogeneous at the distance (difference of 
length). 

2.1.2 Equilibrium position according to Z 

It is with this phase that the innovation compared to the basic barycen- 
tric methods is. It may be indeed in any cases that points not having 
anything to make with the cluster drawn towards the pole are found 
located at the same position [see figure [Q. If we base ourselves only on 
a computation of distances to determine the contents of the clusters, 
such points are thus added wrongly. 
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Figure 1: Diagram of a subset including a point located inopportunely at 
the same position 



To locate these points, the idea is to apply an external force accord- 
ing to Z which indicates if each point is overall dependent with close 
points according to dimension X. The computation of the positions of 
equilibrium of systems including uniform external forces can't avoid 
the resolution of systems of equations, which implies an important 
complexity. We are thus constrained to escape from physical reality 
and propose a model which have the same advantages but which the 
solving is of a linear complexity. We preserve the same definition of 
fixed and mobiles points that in the computation of coordinates of X. 
The value of fixed points is 1, the one of the mobile points is initialized 
to 1 too. We repeatedly calculate the coordinates according to Z of 
each mobile point by weighted average by the value of the links and 
the distance of the neighbours. The difference between the position 
with the preceding iteration and the new position is noted dl. To rep- 
resent the uniform force, we apply to the mobile point an additional 
displacement d2. 

Additional displacement d2 is related to X and Z. 
d2 (X, Z) = d2x (X) * d2z (Z) 

Several choices of functions are possible for d2x and d2z. d2x is choosen 
so that only the points having the greatest probability of belonging to 



3 



the community of the pole fall under the effect of the external force and 
d2z so that the points tend to be stuck at the boundaries. The interval 
of displacement following Z is limited between and 1: any final po- 
sition not included in this interval is brought back to the nearest bound. 

For pratical puposes, we retained for d2x a sigmoid function [see figure 
E] and for d2z a square function [see figure E]- 



Figure 2: Sigmoid function 



Figure 3: Square function 



The parameters of the sigmoid function depend on the number 
of points expected in our communities. If the estimated size of the 
communities is completely unknown or very variable, we can also test 
several values and retain that is before a jump in term of a number of 
points selected in the subset [see figure 0]. 
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Figure 4: Diagram of the number of effectively selected points according 
to the estimated size of communities 



In equilibrium, the subset of points to retain is consisted of all the 
points whose coordinate Z is lower than a threshold [see figure EJ- 




Figure 5: Diagram of the same subset including a point located inoppor- 
tunely at the same position but this time with coordinates Z 



Note: We do not have the theoretical proof that the algorithm using 
such functions converges and even less in a time not depending on n. 
Nevertheless, multiple tryouts indicate that coordinates according to Z 
converge in a few tens of iterations, whatever the size of the problem. 

2.2 Gathering the subsets of points in stable com- 
munities 

The second part of this algorithm is inspired of an algorithm of Huber- 
man [05] modified to take into account the fact that we do not have as 
input a list of partitions of a graph, but only subsets of points forming 
subdivisions of communities. 

2.2.1 Determination of attractive poles for initializa- 
tion 

It is a matter of defining, utilizing Huberman's words, a "basic mas- 
terList", some sort of skeleton of the main components of each commu- 
nity. We proceed by simple counting. Let us imagine that on the whole 
of the subsets, item X is present n time. It is maximum frequency of 
a point. We calculate for each other point Y, how many times Y is 
in the same subset that X. We put together the Yi points which are 
more than 50% of the times together with X: We have just constituted 
first attractive pole. Among the remaining points, we choose the point 
presenting the most occurrences and we proceed in the same way to de- 
termine the second attractive pole. We continue until the list is empty 
or no more attractive pole found has a sufficient size (size significatively 
different than the size of the attractive poles previously found). Thus, 
the points remaining at the end, not associated to any attractive poles 
will be either points not very related to the communities or points re- 
lated in an identical way identical to several communities. However 
that may be, they will be difficult to classify. 
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2.2.2 Fastening of the subsets of points to the attractive 
poles 

The continuation of the algorithm is almost identical to Huberman's 
one, with this difference that the formula retained for the measurement 
of proximity takes into account the non-homogeneity of the number of 
occurrences of the points: indeed, in the case treated by Huberman, 
the numbers of occurrences of each point are equal because a point is 
just represented once in each partition. For each group of points, we 
thus calculate its proximity with each community of the masterList 
then we amalgamate the group at its nearest community. This oper- 
ation has a complexity of O(n), in fact we restrict this computation 
of proximity to the only communities that include at least one of the 
points of our group. The value of t is then the number of times where 
the element of the masterList was selected to combine. Once all the 
groups of points have merged with the attractive poles, we have to 
remove the points defined in an artificial way as attractive poles at the 
time of the initialization then calculate the relative part of presence of 
each point in each community. 

Note: This phase of fastening to attractive poles may sometimes pro- 
duce different results functions of the order of presentation of the sub- 
sets. This order being arbitrary, we have chosen to repeat the operation 
of aggregation a great number of times with variable orders of presen- 
tation then to make an average by attractive pole in order to obtain 
stable results (the list of attractive poles being constant regardless of 
the order of presentation). 

3 Applications 

We will quickly show the results obtained on simple graphs of average 
size (a few hundreds of points) . These graphs are not valued and do not 
have, a priori, a structure of overlapping communities. We thus use in 
these two cases our algorithm to make partitioning. This one producing 
fuzzy communities, we consider that the partitions are made up of the 
points which belong to the communities with a maximum rate. 

3.1 Computer-generated graphs 

We test the performance of our P&A algorithm on networks con- 
structed with 192 vertices, divided into 12 separate communities of 
equal sizes, whom the average degree of vertices is 16. We make vary- 
ing intra-community connections and determine how our algorithm 
performs in the cut of the graph in each case. To do so, we define 
a "success rate" representing the percentage of concord between the 
graph associated to the solution produced by the algorithm and the 
one produced by the parameters which allow us to build the computer 
generated graph. 
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As we can see on figure E3 the algorithm works perfectly when the 
number of intra-community connections is greater than the number 
of inter-community connections (intra 8.5 - inter 7.5 - success rate: 
100%). It's a fairly good result for the algorithm. 



a 0.6 - 




2 4 6 8 10 12 14 16 

intra-communauty connections 

Figure 6: Success rate of the P&A algorithm functions of intra-community 
connections of computer-generated graphs 



3.2 Football Championship 

We tested our algorithm on the data file "College football 2000" which 
was already analyzed amongst other things by Newman [01] and Radic- 
chi [07] . We chose the spring as model of force and a neighbourhood 
of 2 steps for the determination of the mobile points. 

See figured for results obtained by P&A algorithm (noted PA), com- 
pared with of Newman's (noted GN) and Radicchi's (noted RA) and 
figure |H1 for complete results... 

From this results, we can read for instance that node 84 belongs to 
community 2 at 87% and to community 8 at 13%. We can thus locate 
the nodes which belong to two or several communities that algorithms 
of strict partitioning would arbitrarily allot to one of them... For ex- 
ample, according to the algorithm P&A, node 50 belongs at 54% to 
the community H and at 44% to the community J. Twice algorithms 
GN and RA allot it to the community J, but they could have allotted 
it to H... We also notice that P&A algorithm classifies all the points, 
contrary to the two other algorithms. 
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Figure 7: Comparison with GN's algorithm and Radicchi's 
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Figure 8: Complete results with percentage of belonging 
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4 Conclusion and opening 



The P&A algorithm, in addition to its low complexity which makes it 
perfectly adapted to calculations in very large graphs, presents a good 
number of other advantages. 

First of all, the P&A algorithm can be implemented and run in par- 
allel, at least for the phase of extraction of the subsets which includes 
the greatest part of calculations: different machines or processors can 
compute their batch of pole in a completely independent way. 

Moreover, the P&A algorithm relying on local behaviour, it makes 
possible to detect the communities of delimited parts of the graph, 
without examining the wholeness of the graph. To do so, all we need 
to do is to select, as poles, points we want to associate to a community 
along with their n steps neighbours. 

Lastly, the P&A algorithm is more easily adaptable than purely mathe- 
matical methods: choice of thresholds, functions (specific forces: springs, 
rubber bands, models with rupture... | uniform forces: linear, sinu- 
soid, sigmoid...) while preserving effective default values. It is thus 
particularly adapted to the networks with complex interactions (social 
networks for example) . It should nevertheless be noted that according 
to the forms of forces, there is no formal proof that the algorithm con- 
verges and even less in a linear time. We can only see that on practical 
cases. 

To go further, we will soon propose a more significant number of appli- 
cations of this algorithm on real valued graphs which include overlap- 
ping communities, such as social networks extrapolated from graphs of 
communications. We will apply other models of forces covering possi- 
ble sociological significances. 
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